-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save and parse Arkane species with all statmech properties using YAML format #1402
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1402 +/- ##
==========================================
- Coverage 42.07% 42.02% -0.05%
==========================================
Files 165 165
Lines 27824 27867 +43
Branches 5668 5680 +12
==========================================
+ Hits 11706 11711 +5
- Misses 15332 15371 +39
+ Partials 786 785 -1
Continue to review full report at Codecov.
|
39614f3
to
7ac6569
Compare
If this is going to be a brand new file type, I strongly suggest using an established format such as yaml, which you can parse using pyyaml to give a python dict. Using exec on user provided input files is a security risk since we could be executing arbitrary code. It would be best to convert all of our input files and database files to not require exec, but it'll take time. Regarding the idea of having a database of statmech info, it would again be best if it's not a plain-text database like RMG-database is currently. Kehang's thermo central database uses mongoDB, which is an option, especially since it seems like it would overlap or be a part of this future statmech database. There are also tons of other options, but we should definitely put some thought into the database design. |
I changed the format into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like where this is going! I made a few comments, feel free to push back on any if you have other thoughts.
rmgpy/cantherm/common.py
Outdated
from rmgpy.molecule.translator import toInChI, toInChIKey | ||
from rmgpy.thermo import NASA, NASAPolynomial, Wilhoit, ThermoData | ||
from rmgpy.statmech.conformer import Conformer | ||
from rmgpy.statmech import IdealGasTranslation, LinearRotor, NonlinearRotor, HarmonicOscillator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some of these imports are unused, so you could probably remove them.
Also, imports should be grouped by standard imports, third-party libraries, then local imports (PEP 8). It's not that important, but something to consider. For example:
import logging
import os.path
import re
import time
import numpy
import yaml
from yaml import ...
import rmgpy.constants as constants
from rmgpy import ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for informing me of the structure, I'll adhere to it.
The imports here are being used implicitly by the eval()
function which creates objects from these classes if they are present in the .yml file, so I think we should leave them (otherwise it errors). Since advanced text editors mark them as unused, I think it's a good idea to add a comment next to them explaining why they are kept. I'll happily reorganize imports across Cantherm.
rmgpy/cantherm/common.py
Outdated
' kJ/mol. This can cause significant errors in your computed rate constants. ') | ||
|
||
|
||
def determine_molecular_weight(species): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of creating this method, you should modify the getter for Species.molecularWeight
.
First, it would be more convenient for me if you could update the syntax to match my PR (this commit). Then, you can change the get method:
@property
def molecularWeight(self):
if self._molecularWeight is None:
self._molecularWeight = quantity.Mass(self.molecule[0].getMolecularWeight(), 'kg/mol')
return self._molecularWeight
Once you do that, you can access the Species.molecularWeight
property regardless of whether or not it's been set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment!
I'll implement it in the Species __init__()
:
def __init__(self, ..., molecularWeight=None, ...):
if self.molecularWeight is None and self.molecule is not None and len(self.molecule) > 0:
self._molecularWeight = quantity.Mass(self.molecule[0].getMolecularWeight(), 'kg/mol')
else:
self.molecularWeight = molecularWeight
Shall I still change the syntax of molecularWeight
? I can leave it untouched so #1329 will have less conflicts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second though, your recommendation seems better if for some reason an empty Species object is first created, then a structure is assigned.
rmgpy/cantherm/common.py
Outdated
d['thermo'] = species.thermo | ||
d['chemkin_thermo_string'] = chemkin_thermo_string | ||
|
||
self.species_dictionary = d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be called species_dictionary
. I also think that you could save all of these parameters as normal attributes, then create a method as_dictionary
if you need them as a dictionary.
If you do that, I think you also wouldn't need the update
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
I think I still need something like the update() method, though, to only append the species attributes to this object (the object is initiated with non-species-specific attributes such as author
and level_of_theory
which are kept)
rmgpy/cantherm/data/HSOO.yml
Outdated
SMILES: '[O]OS' | ||
adjacencyList: 'multiplicity 2 | ||
|
||
1 S u0 p2 c0 {2,S} {4,S} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why there are extra blank lines in the adjlist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why, it bugs me as well. But I verified that the adjList is being read correctly.
If I simply print it during the run (either using print
or logging.info
) it looks correct with no extra line brakes.
rmgpy/cantherm/data/HSOO.yml
Outdated
transport_data: TransportData(epsilon=(3625.11,'J/mol'), sigma=(5.7,'angstrom')) | ||
use_atom_corrections: true | ||
use_bond_corrections: false | ||
use_hindered_rotors: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these parameters don't seem like they would be important to store, e.g. use_atom_corrections
, use_bond_corrections
, use_hindered_rotors
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least use_bond_corrections
is crucial, the rest are just "nice to have".
use_bond_corrections
: This might impact the decision of the user of whether or not to use this file. If the user wants to use this species in a PES with other species calculated at different levels of theory, it is important/recommended to use bond corrections. On the other hand, if all species were calculated at the same level, it is better practice not to use bond corrections (which don't capture all bond types, and are often ambiguous for TSs).
use_hindered_rotors
: This could also be directly inferred from the Conformer object. However, for a quick assessment, it's nice to have it explicitly written.
use_atom_corrections
: Can be inferred from the atom_energies
attribute, I'm OK with removing it since using this feature is less common.
rmgpy/cantherm/input.py
Outdated
spec.conformer = Conformer(E0=E0, modes=modes, spinMultiplicity=spinMultiplicity, opticalIsomers=opticalIsomers) | ||
if molecularWeight is None: | ||
if is_pdep(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this work if the species block comes before the pdep block?
Also, if you refactor the molecular weight implementation based on my other comment, would that eliminate the need for this if block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modified
rmgpy/cantherm/input.py
Outdated
""" | ||
Load the species with all statMech data from the .yml file | ||
""" | ||
global jobList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem like you use jobList in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
rmgpy/cantherm/input.py
Outdated
coordinates = np.ndarray(shape=(len(number), 3)) | ||
for i, coorlist in enumerate(job.species.conformer.coordinates.value_si): | ||
coordinates[i] = [coor for coor in coorlist] | ||
mol = Molecule().fromXYZ(atomicNums=number, coordinates=coordinates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section seems oddly coded. Was it done intentionally? It seems like the following would be sufficient:
number = job.species.conformer.number.value_si.tolist()
coordinates = job.species.conformer.coordinates.value_si.tolist()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the issue here is that we must pass a np.ndarray
type to Molecule().fromXYZ()
The data generated from .tolist()
throws an error.
I'll be happy to consider a different alternative to the non straight forward code I used, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I misinterpreted it, I thought initially that the conformer attributes were np.ndarray
objects, and you were converting them into lists.
If the conformer attributes are np.ndarray
, it seems you could use them directly. If they're lists, then couldn't you use np.array(list_input)
to convert it to an np.ndarray
?
rmgpy/cantherm/statmech.py
Outdated
filename, file_extension = os.path.splitext(self.path) | ||
if 'yml' in file_extension or 'yaml' in file_extension: | ||
self.load_species_from_database = True | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like you could just call load_species_from_database_file
here instead of main.py
, and it would also eliminate the need for the load_species_from_database
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, will change accordingly
rmgpy/cantherm/thermo.py
Outdated
@@ -188,6 +188,30 @@ def save(self, outputFile): | |||
f.write('\n') | |||
f.close() | |||
|
|||
return chem_string | |||
|
|||
def save_species_to_database_file(self, path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may be more appropriate as a method of CanthermSpecies
. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree
@mliu49, Thanks for all the valuable comments! |
@mliu49, can I squash this branch? |
Yeah, you can go ahead and squash. I guess this might get replaced by the new method though, if that turns out ok. |
rmgpy/rmg_obj.py
Outdated
############################################################################### | ||
import numpy as np | ||
|
||
class RMG_Obj(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple comments on naming:
- Class names should be CamelCase, though I think we can make an exception for RMG. I would probably call this
RMGObject
. - Module names are ideally one word. Perhaps something like
object
,base
, orcommon
? - Variable names should be lowercase. I think
D
should be renamed to something else, likeoutput
or evenoutput_dict
.
Also, could you end the file with a newline?
I don't think we discussed how to handle Quantity objects with the new dict approach - we would need some way to save and load the units. |
They're our own objects right? So can't we just make them inherit from RMGObject? |
rmgpy/rmgobject.py
Outdated
def __init__(self): | ||
pass | ||
def as_dict(self): | ||
D = self.__dict__.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this addition, @mjohnson541 !
I think that self.__dict__
contains methods as well, and we should first check if not callable(value)
before adding entries to the output dictionary.
322feb1
to
4bfbbb8
Compare
I rewrote this branch using the I'll note that since cythonized classes cannot inherit from non-cythonized classes, and vice versa, I had to make two parent classes: |
b132315
to
ad4e8c9
Compare
Are you sure that Python classes can't subclass Cython classes? It should be possible according to the Cython documentation (section 3.5). |
rebased. @mliu49, let me know when I can squash. |
@mliu49, let me know when to squash the fixups to finalize the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a few more comments. You can go ahead and squash everything else.
arkane/input.py
Outdated
" cannot be reconstructed.".format(spec.label)) | ||
if spec.molecularWeight is None and is_pdep(jobList): | ||
# If one of the jobs is pdep and no molecular weight is given or calculated, raise an error | ||
raise ValueError("No molecularWeight was entered for species {0}. Since a structure wasn't given" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, my suggestion might not have been correct, since this now misses the case where a structure is not provided. One possibility:
if molecularWeight is not None:
spec.molecularWeight = molecularWeight
elif spec.molecularWeight is None and is_pdep(jobList):
raise ValueError()
There shouldn't be any need to check whether structure exists, because the the Species.molecularWeight property checks if it has a molecule attribute, which will be empty if structure was empty.
rmgpy/rmgobject.pyx
Outdated
@@ -1,10 +1,11 @@ | |||
# cython: embedsignature=True, cdivision=True | |||
#!/usr/bin/env python | |||
# -*- coding: utf-8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually isn't necessary either, since this is technically not a python file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean the .pyx or the .pxd file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both
arkane/qchem.py
Outdated
@@ -273,12 +273,13 @@ def loadEnergy(self, frequencyScaleFactor=1.): | |||
""" | |||
E0 = None | |||
with open(self.path, 'r') as f: | |||
for line in f: | |||
lines = f.readlines() | |||
for line in lines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation is preferable because it treats the file as an iterator, so only one line is loaded at a time. If you use readlines(), then the entire file has to be loaded into memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I think this isn't related to this PR (sorry) so I removed it, we could discuss it at another time (I think it might be necessary for running Arkane through ARC). Thanks for bringing this up, I removed the commit.
For ScalarQuantity and ArrayQuantity
Classes modified: Conformer, SingleExponentialDown, Mode, TransportData
simply by providing the structure
Along with its .yml file
- Added YAML description for the species() block - Added levelOfTheory and author descriptions
OK, done and squashed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
We'd like to be able to conveniently use previously calculated statmech data in Arkane (prev. Cantherm). The purpose is to have an online repository of such files, from which users can pick the desired species at the appropriate level of theory (if available). This PR sets the ground by allowing Arkane to save and parse such files.